Corpus-based Name Standardization

نویسنده

Gerrit Bloothooft

چکیده

Variation in the spelling of names has various origins, many of which many are difficult to describe by rule. We present a method that uses both rules and a similarity measure of a probabilistic nature, and which can make use of existing onomastic corpora. Rules first convert an unknown name to a semiphonemic form. Then a selection is made of possible candidates in the onomastic corpus. For this set, the similarity to the unknown name is computed and a decision procedure chooses the best candidate. If no specific onomastic corpus is available, the method provides a tool for a clustering of similar names. The method is demonstrated on a corpus of 49.193 first names from 18th century parish registers, in the availability of a Dutch corpus with 22.579 variants of 4.482 base forms of first names.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Globalization, Standardization, and Dialect Leveling in Iran

This paper is an attempt to shed light on the effects of modernization, urbanization, monolingual educational system, and mass media as well as the process of globalization on dialect leveling among Persian dialects. In so doing, the first part of the paper elaborates on the relationship between globalization and sociolinguistics, and on the concept of standardization. Also, it discusses some ...

متن کامل

Corefrence resolution with deep learning in the Persian Labnguage

Coreference resolution is an advanced issue in natural language processing. Nowadays, due to the extension of social networks, TV channels, news agencies, the Internet, etc. in human life, reading all the contents, analyzing them, and finding a relation between them require time and cost. In the present era, text analysis is performed using various natural language processing techniques, one ...

متن کامل

Content of Linguistic Annotation: Standards and Practices (CLASP) Research Activities and Findings

25 members of the computational linguistics research community participated in a meeting at New York University on November 7, 2009 to address several difficult questions about the standardization of linguistic content in corpus annotation, where we define the term standardization to include all efforts to improve compatibility or interoperability between annotation content, including not only ...

متن کامل

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

In this paper, we present an automatic system for the morphosyntactic annotation and lexicographical evaluation of historical Portuguese corpora. Using rule-based orthographical normalization, we were able to apply a standard parser (PALAVRAS) to historical data (Colonia corpus) and to achieve accurate annotation for both POS and syntax. By aligning original and standardized word forms, our met...

متن کامل

Services integration, professional autonomy and standardization. Representations of standardization among case managers in the case of integrated service networks implementations

Purpose: Social work (SW) practices are undergoing major transformations generated by change in the governance of health and social policies. These transformations are based on two logical performance, one managerial, resting on New Public Management principles, and another clinic, supported by evidence based practices. The implementation of integrated services is traversed by these two logics ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

History and Computing

دوره 6 شماره

صفحات -

تاریخ انتشار 1994

Corpus-based Name Standardization

نویسنده

چکیده

منابع مشابه

Globalization, Standardization, and Dialect Leveling in Iran

Corefrence resolution with deep learning in the Persian Labnguage

Content of Linguistic Annotation: Standards and Practices (CLASP) Research Activities and Findings

Grammatical Annotation of Historical Portuguese: Generating a Corpus-Based Diachronic Dictionary

Services integration, professional autonomy and standardization. Representations of standardization among case managers in the case of integrated service networks implementations

عنوان ژورنال:

اشتراک گذاری